-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counts the number of lines correctly for files with certain multibyte encodings #1211
Counts the number of lines correctly for files with certain multibyte encodings #1211
Conversation
Thanks for the quick fix! ⚡ |
np, was a fun one to track down. I'd still like some 👀 and a 👍 from @github/encodings before I try this out. |
nice catch! 👍 |
It looks like it's valid to call this method even if `binary?` is true. Encoding as 'ASCII-8BIT' should always succeed.
Awkwardly enough, we've made an implicit guarantee that the data exposed through For now, I'll make a change forcing each line back to the original encoding unless someone has a better idea. |
encoded_newlines = ["\r\n", "\r", "\n"]. | ||
map { |nl| nl.encode(encoding).force_encoding(data.encoding) } | ||
|
||
data.split(Regexp.union(encoded_newlines), -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New strategy that keeps each entry in lines
the same encoding as data
, which (for GitHub blobs) is ASCII-8BIT
. Seem OK, @github/encodings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hot, I like that
Counts the number of lines correctly for files with certain multibyte encodings
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in github-linguist#1211). So the error would only be seen in wild usage (see issue github-linguist#353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.
* Prepare 7.9.0 release * Put back the v7.8.0 version We need this to ensure the versioning used during testing on GitHub.com doesn't cause caching problems in future * fix errors on non-UTF-8 encodings Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings. * Decrease expected error count * Set version to 7.9.0 Co-authored-by: Rick Winfrey <[email protected]> Co-authored-by: Santiago M. Mola <[email protected]>
Currently,
loc
andsloc
are incorrect for UTF-16 encoded files with Windows line endings (example). This is becausedata
and the newline regular expression are encoded as ASCII-8BIT or UTF-8, even if the encoding is detected as something else. In the case of UTF-16LE, Windows newlines are encoded as"\r\0\n\0"
. Because of the\0
separating the\r
and\n
, the current regular expression(\r|\n)
counts two lines even though it should only be interpreted as one single line break.The symptom is that UTF-16LE encoded files with Windows line endings are rendered with whitespace at the bottom that doesn't actually exist because the line count is calculated as twice what it should be.
/cc: @github/encodings